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Abstract 

Determining dense semantie eorrespondenees aeross 
ohjeets and seenes is a dijfieult problem that underpins 
many higher-level eomputer vision algorithms. Unlike 
eanonieal dense eorrespondenee problems whieh eon- 
sider images that are spatially or temporally adjaeent, 
semantie eorrespondenee is eharaeterized by images 
that share similar high-level struetures whose exaet ap- 
pearanee and geometry may differ. 

Motivated by objeet reeognition literature and re- 
eent work on rapidly estimating linear elassifiers, we 
treat semantie eorrespondenee as a eonstrained detee- 
tion problem, where an exemplar LDA elassifier is 
learned for eaeh pixel. LDA elassifiers have two dis- 
tinet benefits: (i) they exhibit higher average preeision 
than similarity metries typieally used in eorrespondenee 
problems, and (ii) unlike exemplar SVM, ean output 
globally interpretable posterior probabilities without eal- 
ibration, whilst also being signifieantly faster to train. 

We pose the eorrespondenee problem as a graphi- 
eal model, where the unary potentials are eomputed via 
eonvolution with the set of exemplar elassifiers, and 
the joint potentials enforee smoothly varying eorrespon¬ 
denee assignment. 

1. Introduction 

Unlike canonical dense correspondence problems 
which consider images that are spatially (stereo) or 
temporally (optical flow) adjacent, semantic correspon¬ 
dence is characterized by images that stem from the 
same visual class {e.g. elephants, lammergeiers, car- 
lined streets) whilst exhibiting individual appearance 
and geometric properties. 

For example, given two images of elephants (see Fig¬ 
ure 1), we would like to predict where each pixel on the 
first elephant corresponds to on the second. This is 
particularly challenging because the space of elephants 
exhibits significant intra-class appearance and geomet¬ 
ric variation. A related problem is that of pose es¬ 
timation [14, 21], which considers a smaller fixed set 
of landmarks stemming from a labelled dataset of a 



Figure 1. Dense semantic correspondence estimates how 
points are related between images that stem from the same 
visual class. Here, we wish to predict where each pixel on 
the first elephant corresponds to on the second, whilst be¬ 
ing robust to appearance, pose and background variation. 
The points labelled are representative of the dense corre¬ 
spondence field estimated by our method. 


known object class. From this dataset, one can learn 

(i) the geometric dependency between landmarks, and 

(ii) local detectors that discriminate the appearance of 
each landmark from the background. When presented 
with a new image, one can then estimate the landmark 
locations by solving a graphical inference problem. 
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Liu et a/.’s seminal work of SIFT Flow [11] estab¬ 
lished that a similar strategy could be applied for es¬ 
timating dense semantic correspondence between two 
images stemming from the same semantic class. There 
are three complicating factors however: (i) learning ge¬ 
ometric dependencies between landmarks is impossible 
from only a single example, (ii) learning local detec¬ 
tors is problematic due to the lack of positive training 
samples, and (hi) computational complexity is a ma¬ 
jor concern as we are treating each pixel coordinate 
within the image as a landmark. Liu et al. proposed 
to circumvent these problems by assuming the dense 
geometric dependencies in an image can be adequately 
governed by a variational regularizer, and that accu¬ 
rate local detections between semantically similar im¬ 
ages can be attained through the Li distance between 
SIFT descriptors. Since there is no learning required, 
this can be performed in a computationally tractable 
manner. 

In this paper, we explore the possibility of actu¬ 
ally learning a discriminative detector at every pixel 
coordinate in an image. Motivated by object detec¬ 
tion literature, we learn a linear classiher per pixel in 
the reference image and apply it in a sliding-window 
manner to the target image to produce a match like¬ 
lihood estimate. Learning a multitude of linear detec¬ 
tors such as exemplar support vector machines (SVMs) 
has typically had two issues: (i) each detector must 
parse the negative set, often with hard-negative min¬ 
ing techniques, leading to long training times, which 
makes training a classiher for every pixel in an image 
intractable, and (ii) since the scale of the outputs de¬ 
pends on the margin, the output conhdences of two 
different SVMs are not directly comparable. 

We leverage recent work on learning detectors 
quickly with linear discriminant analysis (LDA), by 
collecting negative statistics across a large number of 
images in a pre-training phase. Learning a new exem¬ 
plar detector then involves a single matrix-vector mul¬ 
tiplication. Since LDA uses a generative model of the 
class distributions, the posterior probabilities provide 
a quantity that is comparable between detectors. This 
allows us to estimate both the likelihood of matches 
for each pixel individually, and also a global belief of 
match quality. 

2. Prior Art 

Canonical correspondence problems such as stereo 
and optical how typically rely on simple (dis-)similarity 
metrics to describe the likelihood of two pixels match¬ 
ing. In the original work of Horn and Schunck [6], this 
was Euclidean distance on raw pixel intensities, which 
manifested a brightness constancy assumption. 


Since then, signihcant literature has focused on de¬ 
termining robust metrics under increasingly adverse 
conditions - from non-rigid deformations and occlu¬ 
sions, to non-global intensity, constrast and colorimet¬ 
ric changes [1, 13, 16, 17]. Importantly, however, ah 
of these works assume the images being observed stem 
from the same underlying scene. 

SIFT Flow hrst introduced the idea of semantic cor¬ 
respondence across scenes [11]. While the method uses 
a simple Li metric, the images are represented in dense 
SIFT space typically associated with sparse keypoint 
matching.^ This sacrihces some localization accuracy 
for improved geometric invariance. We maintain, how¬ 
ever, that similarity metrics are insufficient for estimat¬ 
ing the likelihood of pixels matching between different 
scenes. Instead, we advocate the use of classffiers, as 
per deformable face htting and pose estimation litera¬ 
ture, except where a classiher is trained per pixel. 

We leverage recent work on rapid estimation of LDA 
classffiers [3, 19] to achieve this goal, though fast corre¬ 
lation hlter estimation [4] is potentially equally appli¬ 
cable. The method we present is largely agnostic to the 
objective used to learn the linear detectors {e.g. SVM, 
LDA, correlation Liters), however LDA classifiers are 
attractive in producing globally interpretable outputs 
across pixels, and requiring only a single matrix-vector 
multiplication to train, which is critical to learning 
> 10, 000 classffiers per image. 

A number of dense correspondence methods have 
made use of discriminative pre-training [10, 15, 17], 
with the recent work of [9] being particularly relevant 
to our discussion. In this work, a classifier of the form 
/(4>(cci) — 4 >(cc 2)) is trained to predict a (binary) like¬ 
lihood of two pixels matching. Intuitively, the classi¬ 
fier learns the modes and scale of variation in the un¬ 
derlying feature space 4> that are important and those 
that are distractors. Training is fully supervised from 
groundtruth optical how data. 

Like SIFT Flow, [9] formulate the correspondence 
objective as a graphical model ([7, 8] respectively). 
This has the distinct advantage over variational meth¬ 
ods of permitting very large displacements and arbi¬ 
trarily complex data terms, at the expense of requiring 
simple regularizers to keep inference tractable. More 
recently, a number of variational methods have used 
sparse descriptor matching to anchor larger displace¬ 
ments [2, 20]. While both methods use robust SIFT 
descriptors for keypoint matching, in a semantic cor¬ 
respondence setting the best match is infrequently the 
true correspondence, leading to poor initialization of 
the densihcation stage. 

^Feature representation and similarity metric are intrinsically 
related, since /(^(cci), 0(£C2)) = g{xi,X2)- 



3. Dense Semantic Correspondence 

Given two images, Xa G and Xb G , and a 

discrete set of points cc, dense semantic correspondence 
involves minimizing the inverse fitting problem, 

MN 

X* = argmin ^ fi{xi) + \g{x) (1) 

i=l 

where / is the unary function that evaluates the like¬ 
lihood of a particular assignment for each Xi based on 
the image content, and is a regularizer which enforces 
constraints on the joint configuration of the points. In 
semantic correspondence, the unary function must be 
a good indicator of semantic similarity, and so must 
be robust to significant intra-class variation. In the 
framework we adopt, there are no constraints on its 
complexity or properties. 

SIFT Flow [11] adopts a unary of the form, 

fi{xi) = h{i,Xi) = II^a(^) - ^B{xi)\\i (2) 

where = ^{xi;XA) is a feature representation 

of the image Xa evaluated at the point Xi.‘^ 

In [9], the Li norm on the difference between fea¬ 
tures is replaced with a more general learned represen¬ 
tation, 

h{i,Xi)=H{^A^)-^B{Xi)) (3) 

In both formulations, however, the unary function is a 
stationary kernel. This implies a feature space capable 
of producing similar outputs for semantically similar 
inputs. Finding such a feature embedding is a difficult 
task in general, and as a result significant object de¬ 
tection literature has focussed on learning classifiers to 
distinguish classes instead. 

The use of classifiers has two distinct advantages 
over stationary kernels for describing match likelihood. 
First, linear classifiers define half-spaces in which sam¬ 
ples are either classified as positive or negative. Thus 
two points with dissimilar appearances can still be af¬ 
forded a high match likelihood. Second, the impor¬ 
tance of different dimensions in the feature space can 
be learned from data. 

In this paper, we advocate a unary function of the 
form, 

fi{xi) = h{i, Xi) = (4) 

where w^(i) is a linear classifier trained to predict cor¬ 
respondences to pixel i in Xa, with ideal response, 

(5) 

^For our LDA classifiers, we extract features from a window 
of pixels around ccj, but this detail can be subsumed into the 
feature transform <F. 


This is traditional binary classification, where the 
positive class contains the reference pixel, and its true 
correspondence in the target image, and the negative 
class contains all other pixels. Since the correspon¬ 
dence in the target image is not known a priori how¬ 
ever, we rely on the classifier WA{i) to generalize from 
a single training example: ^a(^)- This is known as 
exemplar-based classification [12]. 

The challenge is how to rapidly estimate thousands 
of exemplar classifiers per image in reasonable time. 
The remainder of this section focuses on addressing 
that challenge, and a number of interesting properties 
that arise from our approach. 

3.1. Learning Detectors Rapidly using Structured 
Covariance Matrices 

Linear classifiers have a rich history in computer vi¬ 
sion, not least because of their interpretation and effi¬ 
cient implementation as a convolution operation. Sup¬ 
port vector machines have proven particularly popular, 
due to their elegant theoretical interpretation, and im¬ 
pressive real-world performance, especially on object 
and part detection tasks. A challenge for any object 
detection problem is how to treat the potentially in¬ 
finite negative set (comprising all incorrect correspon¬ 
dences in our case). Object detection methods using 
support vector machines employ hard negative mining 
strategies to search the negative set for difficult exam¬ 
ples, which can be represented parametrically in terms 
of the decision hyperplane. This feature is also their 
limitation for rapid estimation of many classifiers, since 
each classifier must reparse the negative set looking for 
hard examples - knowing one classifier does not help 
in estimating another.^ 

Linear Discriminant Analysis (LDA), on the other 
hand, summarizes the negative set into its mean and 
covariance. The parameters w of the decision hyper¬ 
plane w^x = c are learned by solving the system of 
equations, 

Sw = b (6) 

where S is the joint covariance of both classes and 
b = Xpos — ^neg is the difference between class means. 
[3] made two key observations about LDA: (i) if the 
number of positive examples is small compared to the 
number of negative examples, the joint covariance S 
can be approximated by the covariance of the negative 
distribution alone, and reused for all positive classes, 

^This is not strictly true. Warm starting an SVM from a pre¬ 
vious solution, especially in exemplar SVMs where only a sin¬ 
gle positive example changes, can induce a significant empirical 
speedup, however is unlikely to change the 0() complexity of the 
algorithm. 




and (ii) gathering and storing the covariance can be 
performed efficiently if the negative class is shift in¬ 
variant (ie. a translated negative example is still a 
negative example). 

This second fact implies stationarity of the nega¬ 
tive distribution, where the covariance of two pixels is 
defined entirely by their relative displacement. Impor¬ 
tantly, both [3] and [5] showed that the performance of 
linear detectors learned by exploiting the stationarity 
of the negative set is comparable to SVM training with 
hard negative mining. 

The covariance S can be constructed from a relative 
displacement tensor, according to. 


sian distributions with equal (co-)variance. This per¬ 
mits direct computation of posterior probabilities via 
application of Bayes’ Rule: 


P{CpQs\X^ 


p(x|C7pOs) p(c^pos) 
^p{x\Cn) p{Cn) 


nG{pos,neg} 


where. 


p{x\Cn) = 


(27r)|S|5 


— |(x — £C^)^S ^{x — Xn) 


( 11 ) 


( 12 ) 


With some manipulation, the posterior of Equation 
(11) can be expressed as. 


where i^j^u^v index spatial co-ordinates, and p^q in¬ 
dex channels. We call the maximum displacement ob¬ 
served abs(i—1^), 3hs{j—v) the bandwidth of the tensor. 
Also note that stationarity only exists spatially - cross¬ 
channel correlations are stored explicitly. The storage 
of g thus scales quadratically in both bandwidth and 
channels, though since the detectors we consider are 
typically small-support, we can entertain feature repre¬ 
sentations with large numbers of channels {i.e. SIFT). 

In order to compute g, we gather statistics across a 
random subset of 50,000 images from ImageNet. We 
precompute the covariance matrix of the chosen detec¬ 
tor size (typically 5x5) and factor it with either a 
Cholesky decomposition, or its explicit inverse, mak¬ 
ing sure the covariance is positive-definite by adding 
the minimum of zero and the minimum eigenvalue to 
the diagonal, i.e. (S + min(0, Amin) • I)~^- 

For each pixel in the reference image, we compute, 

Wa(*) = S"^( 5 pos - ®neg) (8) 

which involves a single vector substraction and matrix- 
vector multiplication, where, 

^pos ~ ^A(^i) (9) 

The likelihood estimate for the i-th reference point 
across the target image can be performed via convolu¬ 
tion over the discretize pixel grid, 

fi{x) =WA{i) (10) 

Since storing the full unary is quadratic in the num¬ 
ber of image pixels (quartic in the dimension), we per¬ 
form coarse-to-fine or windowed search as per SIFT 
Flow [11]. 

3.2. Posterior Probability Estimation 

Linear Discriminant Analysis (LDA) has the attrac¬ 
tive property of generatively modelling classes as Gaus¬ 


P{Cpos 

\x) = -— - - 

' ' 1 + e-y 

(13) 

y 

— X S (^^^pos ^neg) 

(14) 


1 C-1;^ _ l;f.T C-1^ 

1 2‘^pos'^ ‘^pos 2‘^neg'^ ‘^neg 

(15) 



(16) 


Equation (13) takes the form of a logistic function, 
which maps the domain (— 00 ... 00 ) to the range 

(0...1). 

The logistic function is typically used to convert 
SVM outputs to probabilistic estimates, however a 
“calibration” phase is required to learn the bias and 
variance of each SVM in the ensemble so their outputs 
are comparable. With LDA, these parameters are de¬ 
rived directly from the underlying distributions. 

Equation (14) is the canonical response to the LDA 
classifier. Equation (15) represents the bias of the dis¬ 
tributions, and Equation (16) is the ratio of prior prob¬ 
abilities of the classes. This must be determined by 
cross-validation (once, not for each classifier), based 
on the desired sensitivity to true versus false positives. 

By completing the squares in Equation (15), we yield 
the final expression for computing the posterior prob¬ 
ability, 

y — (^ 2 ^neg)) S (^pos ^neg) H“ P 

= {x- |(Spos + Sneg))^Wi + fi (17) 

The implication of Equation (17) is that it is no 
more expensive to compute probability estimates than 
to just evaluate the classifier - the computation is 
still dominated by the single matrix-vector product re¬ 
quired to learn the classifier. 

Figure 2 illustrates a representative set of likelihood 
estimates output by our method and SIFT Flow re¬ 
spectively. LDA typically has tighter responses around 
the true correspondence, and better suppression of false 
positives, especially on background content that has no 
clear correspondence. 







Figure 2. From left to right: (a) reference image with reference point labelled in red, and posterior estimates for (b) LDA and 
(c) Li norm. We present a range of points, from distinctive to indistinctive or background. LDA and Li norm have similar 
likelihood quality for distinctive points, but LDA consistently offers better rejection of incorrect matches and background 
content. 


4. Evaluation 

In order to evaluate the efficacy of our method, we 
first wanted to understand how well human annota¬ 
tors perform at semantic labelling tasks. Since we are 
primarily interested in estimating correspondences for 
reconstruction-type objectives, we gathered 20 pairs 
of images from visual object categories which exhibit 
anatomical correspondence, including an assortment 
of animals, trucks, faces and people. Given a set of 
sparsely selected keypoints in the first image of each 
pair, 8 human annotators were tasked with labelling 
the corresponding points in the second image. A rep¬ 
resentative subset of the data is shown in Figure 3. 

A similar experiment was performed in [11], however 
they focussed on correspondences across scenes^ which 
often have no clear correspondence, even for human 
annotators. In contrast, the agreement on our dataset 
is high, with a natural increase in uncertainty from 
corner features, to edges and textureless regions. 


In recognizing that not all features are equally dis¬ 
tinctive, we measure distance from estimated points Xi 
to the groundtruth using Mahalanobis distance, 

di{xi) = \J{xi - - Iii) (18) 

where jii and are the 2D mean and covariance of the 
groundtruth labellings across annotators. [18] motivate 
a similar procedure for human pose estimation. This 
metric has two advantages over Euclidean distance: (i) 
it takes into account spatial and directional uncertainty 
{e.g. correspondences are afforded some slack along an 
edge, but not perpendicular to it), and (ii) it is reso¬ 
lution independent, since distance is measured in stan¬ 
dard deviations. 

Our dataset and metric therefore sets a higher stan¬ 
dard for what is considered a good correspondence, 
both empirically and qualitatively (since readers can 
accurately discriminate good from poor results). All 
results presented in the following section are measured 
under this metric. 






















Figure 3. A representative subset of the groundtruth dataset. From top to bottom: (a) the source images, (b) the target 
images, and (c) the distribution of points selected by the human annotators on the target images. The structure of the 
object is often clearly discernible from the annotations alone. 


4.1. Experiments 

In all of our experiments we resize the source (A) 
and target (B) image so max(M, N) = 150, preserving 
the aspect ratio, and extract densely sampled SIFT 
features. 

The stationary distribution (mean and covariance) 
of SIFT features is estimated from 50, 000 randomly 
sampled images from ImageNet. Classifiers with spa¬ 
tial support 1x1,3x3, 5x5, 7x7 and 9x9 were eval¬ 
uated. The different sizes tradeoff speed, localization 
accuracy and generalization. We found 5x5 classifiers 
provided a good balance between these tradeoffs, and 
the results throughout our paper use this support. 

While the LDA likelihoods are more computation¬ 
ally demanding to compute than Li-norm likelihoods, 
the construction and application of the classifiers can 
be accelerated with BLAS. Estimating 10,000 5 x5 
classifiers and applying them in a sliding window fash¬ 
ion to a 80 X 125 SIFT image (with 128 channels) takes 
approximately 6 seconds. 

We apply our LDA-based correspondence method 
in the same graphical model framework as SIFT Flow. 
We use a coarse-to-fine scheme to handle inference over 
larger images, and grid searched the hyperparameters 
for both LDA and Li based unary functions. Results 
are shown in Figure 4. 

We display the cumulative density for increasing 
number of standard deviations from groundtruth {i.e. 
fraction of points falling within an increasing radius 


from groundtruth). As a baseline, we simply set 
Xi = which acts as a proxy to the global alignment 
bias of the dataset (small flow assumption). In addi¬ 
tion to SIFT Flow, we also compare our method to a 
leading optical flow method, Deep Flow [20]. 

We truncate the CDF due to the long tails for all 
methods compared. This is an artefact of the non- 
global regularization schemes, which allow some points 
to be arbitrarily far from groundtruth without affecting 
others. Finally, in Figure 5 we illustrate a number of 
exemplar correspondences to show the visual quality of 
matches produced by our method. 

5. Conclusion 

In this paper we motivated the application of dense 
semantic correspondence for a range of computer vi¬ 
sion problems which currently rely on synthetic data 
or specialized imaging devices. In contrast to existing 
correspondence methods, which typically use similar¬ 
ity kernels, we proposed using exemplar classifiers for 
describing the likelihood of two points matching. We 
showed that LDA classifiers exhibit 3 desirable proper¬ 
ties: (i) higher average precision than simple measures 
of image similarity such as the Li norm, (ii) signifi¬ 
cantly faster training than exemplar SVMs, and (iii) 
estimates of match confidence that are directly compa¬ 
rable across pixels. 

"^For images of different sizes, we set Xi = Wfz) where W is 
a function that maps the span of I a to Xb • 










1.0 


1 . 0 -' 


1 . 0 -■ 


—^ LDA (ours) —^ LDA (ours) 

SIFT Flow SIFT Flow 



0.8 

S 0.6 

Q 


—^ LDA (ours) 
SIFT Flow 
—^ Deep Flow 
—^ Argmax 
Baseline 



Figure 4. Comparison of sparse keypoint localization for our method, SIFT Flow [11] and Deep Flow [20]. The baseline 
measures the global alignment bias of the dataset (how well one would perform by simply assuming no flow). The argmax 
considers taking the single best match without regularization. The graphs measure the fraction of correspondences which 
fall within an increasing distance from groundtruth. 3 standard deviations is inperceptible from human annotator accu¬ 
racy. From left to right: (a) aggregate results across all images, (b) the truck pair which our method localizes well, and 
(c) the biking pair for which our method fails to produce any meaningful correspondences. 



Figure 5. Example correspondences discovered by our method, across a broad range of image pairs from our dataset. The 
truck pair produces good localization of points (see Figure 4b), whilst the biking pair shows a failure to produce anything 
meaningful (see Figure 4c). 























We presented a small semantic correspondence 
dataset and metric in a bid to measure the performance 
of different methods in a quantifiable manner, and 
showed that under this metric our classifier-based ap¬ 
proach offered improvements over the Li norm, within 
the same SIFT Flow optimization framework. The 
qualitative results illustrate our method’s ability to es¬ 
timate high-quality dense semantic correspondences. 
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